Learning with noisy label (LNL) is a classic problem that has been extensively studied for image tasks, but much less for video in the literature. A straightforward migration from images to videos without considering the properties of videos, such as computational cost and redundant information, is not a sound choice. In this paper, we propose two new strategies for video analysis with noisy labels: 1) A lightweight channel selection method dubbed as Channel Truncation for feature-based label noise detection. This method selects the most discriminative channels to split clean and noisy instances in each category; 2) A novel contrastive strategy dubbed as Noise Contrastive Learning, which constructs the relationship between clean and noisy instances to regularize model training. Experiments on three well-known benchmark datasets for video classification show that our proposed tru{\bf N}cat{\bf E}-split-contr{\bf A}s{\bf T} (NEAT) significantly outperforms the existing baselines. By reducing the dimension to 10\% of it, our method achieves over 0.4 noise detection F1-score and 5\% classification accuracy improvement on Mini-Kinetics dataset under severe noise (symmetric-80\%). Thanks to Noise Contrastive Learning, the average classification accuracy improvement on Mini-Kinetics and Sth-Sth-V1 is over 1.6\%.
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. In addition, Binary Cross Entropy (BCE) loss, which shows conspicuous performance with ViTs, encounters predicaments in LTR. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins to deploy it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT.
translated by 谷歌翻译
High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audios. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance.
translated by 谷歌翻译
本文提出了一种新颖的视频介绍方法。我们做出了三个主要贡献:首先,我们通过引入基于贴片的同型(DEPTH)扩展了以前的变压器,以补丁的对齐方式扩展了贴片对齐,该均值(DEPTH)改善了补丁级的功能对齐,而没有其他有各种变形的监督和受益的挑战场景。其次,我们引入了基于面膜修剪的贴片注意力(MPPA),以通过修剪较少的基本功能和使用显着性图来改善贴合的功能匹配。MPPA用无效的像素增强了扭曲令牌之间的匹配精度。第三,我们引入了空间加权适配器(STA)模块,以在从深度中学到的变形因子的指导下,准确地关注空间代币,尤其是对于具有敏捷运动的视频。实验结果表明,我们的方法在定性和定量上优于最新方法,并实现了新的最新方法。
translated by 谷歌翻译
离线增强学习吸引了人们对解决传统强化学习的应用挑战的极大兴趣。离线增强学习使用先前收集的数据集来训练代理而无需任何互动。为了解决对OOD的高估(分布式)动作的高估,保守的估计值对所有输入都具有较低的价值。以前的保守估计方法通常很难避免OOD作用对Q值估计的影响。此外,这些算法通常需要失去一些计算效率,以实现保守估计的目的。在本文中,我们提出了一种简单的保守估计方法,即双重保守估计(DCE),该方法使用两种保守估计方法来限制政策。我们的算法引入了V功能,以避免分发作用的错误,同时隐含得出保守的估计。此外,我们的算法使用可控的罚款术语,改变了培训中保守主义的程度。从理论上讲,我们说明了该方法如何影响OOD动作和分布动作的估计。我们的实验分别表明,两种保守的估计方法影响了所有国家行动的估计。 DCE展示了D4RL的最新性能。
translated by 谷歌翻译
对卷积神经网络(CNN)的知识蒸馏(KD)进行了广泛的研究,以提高小型模型的性能。最近,Vision Transformer(VIT)在许多计算机视觉任务上取得了巨大的成功,而VIT的KD也需要实现。但是,除了基于输出logit的KD之外,由于巨大的结构间隙,其他基于特征的CNN基于特征的KD方法不能直接应用于VIT。在本文中,我们探讨了对VIT的基于特征的蒸馏方式。根据VIT中特征地图的性质,我们设计了一系列受控的实验,并为VIT特征蒸馏提供了三个实用指南。我们的一些发现甚至与CNN时代的实践相反。根据三个准则,我们提出了基于功能的方法Vitkd,从而为学生带来一致且相当大的改进。在ImagEnet-1K上,我们将DEIT微型从74.42%提高到76.06%,Deit-Small从80.55%提高到81.95%,而Deit-Base则从81.76%升至83.46%。此外,Vitkd和基于Logit的KD方法是互补的,可以直接使用。这种组合可以进一步提高学生的表现。具体而言,学生DEIT微小,小和基础分别达到77.78%,83.59%和85.41%。该代码可在https://github.com/yzd-v/cls_kd上找到。
translated by 谷歌翻译
知识蒸馏(KD)已广泛发展并增强了各种任务。经典的KD方法将KD损失添加到原始的跨熵(CE)损失中。我们尝试分解KD损失,以探索其与CE损失的关系。令人惊讶的是,我们发现它可以被视为CE损失和额外损失的组合,其形式与CE损失相同。但是,我们注意到额外的损失迫使学生学习教师绝对概率的相对可能性。此外,这两个概率的总和是不同的,因此很难优化。为了解决这个问题,我们修改了配方并提出分布式损失。此外,我们将教师的目标输出作为软目标,提出软损失。结合软损失和分布式损失,我们提出了新的KD损失(NKD)。此外,我们将学生的目标输出稳定,将其视为无需教师的培训的软目标,并提出了无教师的新KD损失(TF-NKD)。我们的方法在CIFAR-100和Imagenet上实现了最先进的性能。例如,以Resnet-34为老师,我们将Imagenet TOP-1的RESNET18的TOP-1精度从69.90%提高到71.96%。在没有教师的培训中,Mobilenet,Resnet-18和Swintransformer-tiny的培训占70.04%,70.76%和81.48%,分别比基线高0.83%,0.86%和0.30%。该代码可在https://github.com/yzd-v/cls_kd上找到。
translated by 谷歌翻译
基于匹配的方法,尤其是基于时空记忆的方法,在半监督视频对象分割(VOS)中明显领先于其他解决方案。但是,不断增长和冗余的模板特征导致推断效率低下。为了减轻这一点,我们提出了一个新型的顺序加权期望最大化(SWEM)网络,以大大降低记忆特征的冗余。与以前仅检测帧之间特征冗余的方法不同,Swem通过利用顺序加权EM算法来合并框架内和框架间的相似特征。此外,框架特征的自适应权重具有代表硬样品的灵活性,从而改善了模板的歧视。此外,该提出的方法在内存中保留了固定数量的模板特征,从而确保了VOS系统的稳定推理复杂性。对常用的戴维斯和YouTube-VOS数据集进行了广泛的实验,验证了SWEM的高效率(36 fps)和高性能(84.3 \%$ \ Mathcal {J} \&\ Mathcal {F} $代码可在以下网址获得:https://github.com/lmm077/swem。
translated by 谷歌翻译
图像检索已成为一种越来越有吸引力的技术,具有广泛的多媒体应用前景,在该技术中,深层哈希是朝着低存储和有效检索的主要分支。在本文中,我们对深度学习中的度量学习进行了深入的研究,以在多标签场景中建立强大的度量空间,在多标签场景中,两人的损失遭受了高度计算的开销和汇聚难度,而代理损失理论上是无法表达的。深刻的标签依赖性和在构造的超球场空间中表现出冲突。为了解决这些问题,我们提出了一个新颖的度量学习框架,该框架具有混合代理损失(hyt $^2 $损失),该框架构建了具有高效训练复杂性W.R.T.的表现力度量空间。整个数据集。拟议的催眠$^2 $损失着重于通过可学习的代理和发掘无关的数据与数据相关性来优化超晶体空间,这整合了基于成对方法的足够数据对应关系以及基于代理方法的高效效率。在四个标准的多标签基准上进行的广泛实验证明,所提出的方法优于最先进的方法,在不同的哈希片中具有强大的功能,并且以更快,更稳定的收敛速度实现了显着的性能增长。我们的代码可从https://github.com/jerryxu0129/hyp2-loss获得。
translated by 谷歌翻译